Introduction

This data comprises statistics on the number of cases and infection rates of sexually transmitted diseases (specifically chlamydia, gonorrhea, and early syphilis, encompassing primary, secondary, and early latent syphilis) that have been reported for California residents. The data is categorized by disease type, county, year, and gender.

The data was collected for cases with estimated diagnosis dates spanning from 2001 up to the most recent year available. It was sourced from California Confidential Morbidity Reports and Laboratory Reports, all of which were submitted to the California Department of Public Health (CDPH) by July of the current year. These reports adhered to the surveillance case definition for each respective disease.

After looking at the data, the main question of interest we wanted to investigate was : Which STD has the highest prevalence in California, and how is this disease geographically spread across the state? Further analysis was conducted to look at the year that had the highest STD rates and the difference between infection rates based on sex.

You can download the report by clicking the “Download the report” button on the top.

Methods

Data cleaning and wrangling

  1. Merge STD and Geographic dataset. The STD data did not include any latitude and longitude coordinates, thus the second data set was introduced to conduct a proper geographic analysis. First, we merged the main data set with the geographic data set.

  2. The combined data set has 11 columns. Among them, columns “Cases” and “Rate” have several missing values because of the “Annotation Code” variable, which prevents them from being publicized. Therefore, these missing values were removed.

  3. The data type of the column “Rate” is chr (character), so we changed it into a numeric format.

  4. The “County” column includes rows called “California”, which is the state not a county, so we delete them. I saved the aggregate “California” data into a new variable “Cali”.

Libraries used

The libraries utilized include : data.table, tidyverse, dplyr, plotly, DT, knitr


Results

Year Sex Count of Diseases Cases Avg Cases SD Rate Avg Rate SD
2001 Female 3 28844 41110.26 166.36667 237.10256
2001 Male 3 12791 12039.87 74.50000 70.10193
2001 Total 3 41944 52847.99 121.56667 153.13407
2002 Female 3 30868 44249.78 175.83333 252.04256
2002 Male 3 14604 13451.54 84.03333 77.40390
2002 Total 3 45743 57455.33 130.90000 164.46115
2003 Female 3 32366 46081.94 181.96667 259.13036
2003 Male 3 15597 14607.35 88.60000 82.95330
2003 Total 3 48067 60327.61 135.83333 170.48444
2004 Female 3 34463 48094.33 191.80000 267.70022
2004 Male 3 17449 15802.51 98.10000 88.84329
2004 Total 3 52057 63409.59 145.63333 177.37061
2005 Female 3 36121 49313.19 199.70000 272.60435
2005 Male 3 18986 16880.86 106.06667 94.31905
2005 Total 3 55352 65824.54 153.83333 182.93885
2006 Female 3 37927 52367.21 208.13333 287.41702
2006 Male 3 19636 17662.97 108.93333 98.00573
2006 Total 3 57842 69800.70 159.56667 192.55101
2007 Female 3 38905 55006.27 211.76667 299.39459
2007 Male 3 20132 18946.50 110.76667 104.23657
2007 Total 3 59252 73844.18 162.10000 202.06019
2008 Female 3 38775 57240.62 209.33333 308.97620
2008 Male 3 20524 21116.05 111.96667 115.16550
2008 Total 3 59523 78460.49 161.53333 212.88817
2009 Female 3 37368 56184.72 200.56667 301.55877
2009 Male 3 20881 21624.72 113.20000 117.26291
2009 Total 3 58451 77871.67 157.66667 210.00991
2010 Female 3 39123 58485.69 208.30000 311.38964
2010 Male 3 22656 23108.14 121.93333 124.32990
2010 Total 3 62018 81629.34 165.96667 218.44346
2011 Female 3 41396 62336.84 218.56667 329.17718
2011 Male 3 23917 24238.04 127.46667 129.13173
2011 Total 3 65509 86568.94 173.73333 229.59082
2012 Female 3 42963 63126.24 224.83333 330.33084
2012 Male 3 26461 24785.36 139.66667 130.83273
2012 Total 3 69561 87651.10 182.80000 230.34715
2013 Female 3 42493 61236.09 220.80000 318.22305
2013 Male 3 28291 24720.36 148.10000 129.42523
2013 Total 3 70880 85474.69 184.90000 222.92422
2014 Female 3 43508 61524.78 224.46667 317.43203
2014 Male 3 31852 26810.47 165.33333 139.20206
2014 Total 3 75474 87593.33 195.30000 226.64997
2015 Female 3 47038 65282.62 241.10000 334.60704
2015 Male 3 37263 29645.73 192.06667 152.76784
2015 Total 3 84435 93888.92 217.00000 241.29758
2016 Female 3 48776 65825.26 248.73333 335.61897
2016 Male 3 42253 31958.22 216.46667 163.76515
2016 Total 3 91342 96335.92 233.43333 246.18985
2017 Female 3 53863 71320.95 273.33333 361.91430
2017 Male 3 48562 35759.58 247.46667 182.20319
2017 Total 3 102643 105316.87 260.96667 267.78516
2018 Female 3 57215 74913.26 289.36667 378.84868
2018 Male 3 51592 38244.01 261.83333 194.10565
2018 Total 3 109081 111573.22 276.33333 282.63286
2019 Female 3 58149 75295.94 293.70000 380.32811
2019 Male 3 53140 39700.64 269.33333 201.20331
2019 Total 3 111531 113640.05 282.13333 287.51773
2020 Female 3 46510 55128.15 234.76667 278.32101
2020 Male 3 43287 28498.95 219.40000 144.41731
2020 Total 3 90065 81740.81 227.76667 206.69713

Figures and Tables

Figure 1

Figure 2

Figure 3

Figure 4

Sex Cases Population Rate
Female 2292 407916 0.5618804
Male 1032 409452 0.2520442
Total 3324 817368 0.4066712

Table 1


Conclusion

Chlamydia held its position as the most prevalent STD in California from 2001 to 2020. The year 2019 witnessed the highest infection rates statewide, with Lake County bearing the brunt of this issue.

An apparent geographic pattern emerged, with the central valley reporting the highest infection rates and a gradual decrease towards the Nevada border. Additionally, a notable gender discrepancy was observed in Lake County in 2019, where females reported twice as many infections as males, highlighting the importance of tailored interventions and awareness initiatives.